Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create decoder for HTML entities #2563

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rgmz
Copy link
Contributor

@rgmz rgmz commented Mar 10, 2024

Description:

This creates a decoder to handle HTML entities. Tests pass, but the implementation may not be the most efficient.

This fixes #2231.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@rgmz rgmz requested a review from a team as a code owner March 10, 2024 15:55

if matched {
decodableChunk := &DecodableChunk{
DecoderType: detectorspb.DecoderType_ESCAPED_UNICODE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a new decoder type?

@dustin-decker
Copy link
Contributor

I think we've reached the point where we should consider adding a --enabled/disabled-decoders flag, similar to what we have for detectors. This one seems pretty impactful on performance in its current state, and many data sources might not benefit much from it.

One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.

@rgmz
Copy link
Contributor Author

rgmz commented Mar 21, 2024

I do worry about the impact of having too many decoders. At a minimum, having something like ahocorasick might be more efficient than checking regexp.Match() against each chunk.

One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.

While I think identifying the mimetype of a file would be a great addition (and make way for other enhancements), I'm not sure how much it would help in this case. HTML, Markdown, and AsciiDoc files are obviously sources that would benefit, but HTML-encoded content can show up in weird places like config files, .txt files, or source code.

This decoder was act inspired by #1550; I found several live connection strings that were not detected by TruffleHog because they contained encoded & instead of a literal &.

mongodb://dave:password@localhost:27017/?authMechanism=DEFAULT&authSource=db&ssl=true"

@rgmz rgmz marked this pull request as draft April 14, 2024 14:21
@rgmz rgmz force-pushed the feat/html-decoder branch from 79050b1 to 2bb2410 Compare June 5, 2024 00:41
@rgmz rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 721ba1d to 0512f94 Compare June 21, 2024 02:54
@rgmz rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 6a98dcc to 1180b27 Compare July 1, 2024 18:38
@rgmz rgmz force-pushed the feat/html-decoder branch 3 times, most recently from 729714d to d612f5b Compare November 8, 2024 14:01
@rgmz rgmz force-pushed the feat/html-decoder branch from d612f5b to 4df5b0e Compare November 11, 2024 19:22
@rgmz rgmz force-pushed the feat/html-decoder branch 3 times, most recently from ca46f5d to 45eb1ed Compare December 2, 2024 14:01
@rgmz rgmz marked this pull request as ready for review December 2, 2024 14:02
@rgmz rgmz requested review from a team as code owners December 2, 2024 14:02
@rgmz rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 3676e9b to cb4c962 Compare December 21, 2024 16:06
@rgmz rgmz force-pushed the feat/html-decoder branch from cb4c962 to 6083804 Compare December 25, 2024 16:22
@rgmz rgmz force-pushed the feat/html-decoder branch from 6083804 to ac02868 Compare December 31, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Create a decoder for HTML entites
2 participants